A comparative study of three models to analyze the impact of air pollutants on the number of pulmonary tuberculosis cases in Urumqi, Xinjiang

In this paper, we separately constructed ARIMA, ARIMAX, and RNN models to determine whether there exists an impact of the air pollutants (such as PM2.5, PM10, CO, O3, NO2, and SO2) on the number of pulmonary tuberculosis cases from January 2014 to December 2018 in Urumqi, Xinjiang. In addition, by using a new comprehensive evaluation index DISO to compare the performance of three models, it was demonstrated that ARIMAX (1,1,2) × (0,1,1)12 + PM2.5 (lag = 12) model was the optimal one, which was applied to predict the number of pulmonary tuberculosis cases in Urumqi from January 2019 to December 2019. The predicting results were in good agreement with the actual pulmonary tuberculosis cases and shown that pulmonary tuberculosis cases obviously declined, which indicated that the policies of environmental protection and universal health checkups in Urumqi have been very effective in recent years.


Introduction
Tuberculosis is a chronic respiratory disease mainly caused by Mycobacterium tuberculosis (M. tuberculosis), which can invade many organs of the human body, the most common is pulmonary tuberculosis (PTB) infection [1]. A total of tuberculosis with 833 thousand cases from China has been reported in 2020 [2]. The incidence of PTB in Xinjiang is the highest in China, with an incidence of 210.75/per100,000 from 2014 to 2018, and shown an overall increasing trend [3]. Urumqi, the capital of the Xinjiang, has the higher incidence of PTB (60/100,000) [4] with the incidence of new smear-positive PTB (14.31/100,000) than the national average level [5]. PTB is transmitted by breathing in droplet nuclei with single Mycobacterium tuberculosis in air from the cough or sneeze of active PTB infected persons [6]. It has been shown that there is a strong association between air pollutants and PTB incidence [7], for instance, PM 2.5 , with the features of being small, light, toxic, and suspended in the air for a long time, which may facilitate the transmission and development of PTB [8]. Oxidative stress and immune inflammatory response produced by human body could increase the risk of PTB [9] when PM 2.5 is deposited in the lungs. Because O 3 is the pollutant produced by NO 2 ultra-violet light, it can worsen lung function, exacerbate airway inflammatory responses and affect lung ventilation. Zhu et al. [10] studied the correlation between PTB incidence and PM 10 and NO 2 in Chengdu from 2010 to 2015 by using a distributed lag non-linear model. A generalized additive model was by Huang et al. [11] applied to analyze the effect of PM 2.5 , PM 10 , and O 3 on PTB incidence in Wuhan from 2015 to 2016. The positive correlation between the air quality index in Beijing and the incidence of PTB was analyzed in [12].
Urumqi, the economic, cultural, scientific, and transportation center of Xinjiang, is an important central city in Northwest China. It's surrounded on three sides by mountains and under the control of cold Mongolian high pressure in winter. This special valley topography would make the airflow difficult to flow horizontally and air pollutants difficult to diffuse and dilute. What's more, the heating period in Urumqi is lasting up to 180 days, and the main energy is coal, which can produce a large number of air pollutants. It has a certain impact on the incidence of PTB. Several studies have shown the effect of air pollution on the PTB cases in Urumqi. For example, an ARMA (1, (1, 3)) + model was established to analyze the correlation between air pollutants and the incidence of PTB in Urumqi from 2014 to 2017 and found that the higher concentration of O 3 , the higher PTB incidence [13]. Yang et al. [14] used a generalized additive model to analyze the relationship between air pollutants and PTB incidence and it was indicated that the combined effect of PM 10 and NO 2 had the greatest impact on the incidence of tuberculosis.
In order to estimate the relationship between variables described disease dynamics, one of the classical statistical approaches is the use of auto-regressive integrated moving average (ARIMA) model. This model is easy to be constructed, only requires intrinsic variables, and has relatively high prediction accuracy [15]. Therefore, it has been widely applied in the prediction of PTB incidence. As a generalized and improved ARIMA model, ARIMAX (Autoregressive Integrated Moving Average-X) model can take into consideration the dependence on time series and the disturbance of random fluctuations. By incorporating the exogenous variables into the ARIMA model, ARIMAX can effectively improve the prediction accuracy and accurately predict the short-term trend of the disease, and had been often applied in the prediction of some diseases, such as influenza [16], hand-foot-mouth disease [17], new crown pneumonia [18] and mumps [19], etc. Tuo et al. [20] separately established ARIMA and ARI-MAX models to analyze and predict the monthly influenza cases in Urumqi from 2013 to 2016. An ARIMAX model was by Li et al. [21] applied to analyze the impact of meteorological factors on the incidence of PTB in Kashgar from 2005 to 2014.
Except for time series analysis, deep learning models, such as Recurrent Neural Network (RNN) model, Long-short Term Memory (LSTM) model, Bi-Directional LSTM model and Gate Recurrent Unit (GRU) model, etc. can also be widely applied to forecast disease incidence [22,23]. RNN is the most common deep learning model which is proposed by Saratha Sathasivam in 1982 [22]. The internal structure of RNN model is simpler, and it can select fewer parameters than other deep learning models with complex structures (LSTM, GRU) [24]. LSTM and GRU models, variants of RNN model, could effectively capture the semantic association between long-term sequences, and alleviate the phenomenon of gradient disappearance [25][26][27]. Moreover, GRU could also reduce the network parameters compared with LSTM, and converge with faster speed [27]. Particularly, RNN is a sub-class of artificial neural network using hidden variables as a memory to capture temporal dependencies between system and control variables, which is more suitable for handling time series data [28]. So it is widely used to predict the incidence of various diseases, such as hepatitis [29], hands-foot-and-mouth disease [30], COVID-19 [31,32], dengue fever [33]. For example, Xia et al. [29] showed that RNN model is significant to forecast the Hepatitis incidence and have the potential to assist the decision-makers in making efficient decisions for the early detection of the disease incidents. Wang et al. [30] predicted the number of hands-foot-and-mouth cases of enterovirus A71 subtype in Beijing from 2011 to 2018 by using RNN model. Kumar et al. [31] constructed RNN model to forecast the counts of newly infected COVID-19 individuals, losses, and cures. In [32], RNN model was confirms to have a better predicting performance compared with LSTM and GRU models. Vicente et al. [33] applied RNN model to determine whether there was a correlation between the confirmed cases of dengue fever and climate variables.
Therefore, based on the above discussion, the impact of air pollutants (O 3 , PM 2.5 , PM 10 , SO 2 , CO, and NO 2 ) on the number of PTB cases in Urumqi was investigated by using the ARIMA model, a multivariate time series ARIMAX and RNN model. And, the best lag orders of the impact of each air pollutant on the PTB cases were determined by the cross-correlation test and spearman rank correlation test. In addition, a new comprehensive assessment index DISO [34,35], which can circumvent the contradiction of performance result (such as better consistency but worse bias for the same model), was applied to select the optimal one in these three models. Finally, this optimal one was used to predict the number of PTB cases in Urumqi from January to December 2019, which provides a theoretical basis for the prevention and control of PTB in Urumqi.

Model specification.
In this paper, it is assumed that both the response series {y t } and the series of input variables {x 1t }, {x 2t }, . . ., {x kt } are stationary, the regression model for the response series and the series of input variables is constructed as follows: where β 0 is the constant of this model, Θ i (B) is the p i -order auto-regressive polynomial of x it (i = 1, . . ., k), F i (B) is the q i -order moving average polynomial of x it (i = 1, . . ., k), B is the delay operator, l i is the delay order of {x it }, {ε t } is the series of regression residual and is stationary because both the response series and the input variables series are stationary. By using ARMA model to extract the relevant information in {ε t }, the following regression model can be established: where β 0 , (B), F i (B), B and l i have the same meaning as the above equation. Θ(B) is the moving average polynomial of {ε t }, F(B) is the auto-regressive polynomial of {ε t }, a t is a white noise series with the mean 0.

Model discernment, parameters estimation, and model diagnosis.
The stationarity of the response series (PTB cases series) and the input variable series (air pollutants series) were tested by the Augmented Dickey-Fuller (ADF) test. If they were non-stationary, the nonseasonal difference and seasonal difference methods were applied to stabilize the series. In addition, we identified parameters (p, q, P, and Q) to establish plausible models by referring to the auto-correlation function (ACF) and partial autocorrelation function (PACF) plots based on the stationary series. Firstly, we determined the seasonal part parameters (P and Q) and then nonseasonal part parameters (p and q) for the ARIMA model. Secondly, for the selected models, the least squares method was applied to estimate the parameters and the Ljung-Box test was applied to examine the residuals. Only when the residuals of the selected models were white noise, indicating that the model completely extracted information from the original data. Finally, the optimal ARIMA model was determined according to the lowest corrected Aiken's information criterion (AIC) and Bayesian information criterion (BIC) [23].

Inclusion of air pollutants.
The corresponding residual white noise sequence of each air pollutant variables was obtained by the optimal ARIMA model selected in subsection 2.2.2. And the optimal ARIMA model of each air pollutant variables were used as filter to obtain the residual white noise sequence of the PTB cases, so the pre-whitening process was completed [36]. Moreover, the best lag orders of the impact of each air pollutant on the PTB cases were determined by the cross-correlation function (CCF) of residual white noise. And those air pollutants variables (P < 0.05) which were significantly correlated with the PTB cases were included in the multivariate ARIMA model, it was mean that the ARIMAX models were constructed.

Recurrent neural network model
RNN could be used to describe the relationship between the current output of a sequence and the previous information, which usually consists of an input layer, a hidden layer, and an output layer. RNN is different from the traditional artificial neural network in that it adds connections between the neurons in the hidden layer based on layers fully connected. The unfolding diagram of the forward propagation of the RNN was shown in Fig 1, and the corresponding model is as follows: where x t represents input at time t, h t represents the corresponding hidden state at time t, U and W are the weight of x t and h t , respectively, O t represents output at time t, where V represents the weight of O t , f is any activation function. Therefore, the input of the RNN hidden layer includes not only the output of the input layer, but also the output of the upper time hidden layer. The data was divided into training set, testing set, and predicting set in a 6:2:2 ratio. In each RNN model, the learning rate was set to 0.05, 0.1, and 0.2 and the dimensions of the hidden layer to 3, 5, and 10, respectively, then the appropriate training epochs were identified through an epoch-error plot. Each RNN model was trained three times and the most appropriate parameters of each RNN model were determined. We used testing set to compare the performance of each model and determine the optimal RNN model. Firstly, the original data was normalized by the following formula, that is, all values were converted to the interval [0, 1], where X are the values of original data, X max is the maximum value of the original data, X min is the minimum value of the original data, and X 0 are the normalized values after conversion. Secondly, five different RNN models which did not incorporate air pollutants were constructed, by separately using the number of PTB cases in the previous month and the previous two, three, six, and twelve months as sequential inputs of the training set, and the number of PTB cases in the current month as the output of the training set. The performance of five RNN models were compared by using testing set and an optimal model was selected. Then, by Spearman rank correlation test, the correlations between PTB cases in the current month and air pollutants with a lag of 1 to 12 months were separately evaluated. Thirdly, those air pollutants (P < 0.05) which were significantly correlated with the PTB cases were incorporated into the optimal RNN model and the best lag order of the impact of each air pollutant on the PTB cases were determined. The optimal model incorporating air pollutants was finally determined, by using testing set to compare the performance of all RNN models.

Model assessment 2.4.1 MAPE and RMSPE criteria.
Prediction accuracy is an important criterion for evaluating forecasting validity. For such a reason, an error analysis based on two statistical measures, i.e. the Mean Absolute Percentage Error (MAPE) and Root Mean Square Percentage Error (RMSPE), is employed to estimate model performances and reliability [37]. The MAPE and the RMSPE are defined as where n is the number of data, x t andx t are the actual and forecast values at time t, respectively. The criteria of MAPE and RMSPE are shown in Table 1 [38,39], which was used to evaluate overall model performance. It is a merge of different statistical metrics including R, AE, and RMSE according to the distance between the simulated model and observed field in a three-dimension space coordinate system. DISO is defined as follows: DISO ¼ ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi q where R is Correlation coefficient, NAE and NRMSE are normalized AE and RMSE, respectively, its' formulars are as follows: ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi P n i¼0 a i À � a ð Þ 2 q ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi P n

Descriptive statistics of PTB cases
During the study period from January 2014 to December 2018 (60 months), a total of 14151 PTB cases were included, with the average of 2830 cases per year and the maximum of 4470 cases in 2014.The total number of annual PTB cases in 2015 was considerably lower than that in 2014 by 44%, and then the changing trend was relatively gentle. In Fig 2, it was shown that the number of PTB cases in Urumqi had an obvious seasonal pattern and a long-term trend of

PLOS ONE
Analyze the impact of air pollutants on the number of pulmonary tuberculosis cases in Urumqi, Xinjiang gradual decrease. The seasonal index (or called season exponent), which can reflect a stable relationship between the average number of new monthly PTB cases and the average number of new total PTB cases, with a peak in annual October and a valley in February of the next year. There were seasonal fluctuations and periodic trends of air pollutants in Urumqi, roughly showing the variation of single peak and single valley (S1 Fig). CO 2 , PM 2.5 , SO 2, and NO 2 had similar seasonal patterns, with higher values occurring from December to January of the next year. The peak of the O 3 occurred from May to June and the seasonal fluctuation of PM 10 is unstable. In addition, the median of CO, PM 2.5 , PM 10 , NO 2 , SO 2, and O 3 were 0.93 μg/m 3 , 44 μg/m 3 , 111 μg/m 3 , 12.5 μg/m 3 , 46 μg/m 3 , and 63.5 μg/m 3 , respectively (S1 Table).

Results of model discernment, parameters estimation, and model diagnosis
As shown in Fig 2, it was obvious that the series of PTB cases in Urumqi was non-stationary. ACF diagram and PACF diagram were obtained after first-order difference (see Fig 3). The ACF diagram showed that the ACF values fall into double standard deviation intervals after lagging 2 orders. In conclusion, the series of PTB cases after first-order difference had a shortterm correlation and it was stationary by the ADF test (ADF = −9.14, P < 0.05).
The model ARIMA (P, 1, q) (P, 0, q) 12 was preliminarily determined by the data characteristics of the number of PTB cases and the process of stabilization. Next, in order to choose the optimal model in a larger range, the analysis of ACF and PACF was performed and showed p, q, Q = 0,1 or 2, P = 0 or 1 (see Fig 3), so there was a total of 3 × 3 × 3 × 2 = 54 different choices.

PLOS ONE
Analyze the impact of air pollutants on the number of pulmonary tuberculosis cases in Urumqi, Xinjiang T-tests for the coefficients of 54 models and Box tests for the residuals [24] were separately implemented. Finally, 10 models passed the test and their goodness-of-fit evaluation results were provided in Table 2 by using AIC, BIC, and MAPE criteria.

PLOS ONE
Analyze the impact of air pollutants on the number of pulmonary tuberculosis cases in Urumqi, Xinjiang (see Table 2). The results of parameters estimation and white noise test of model ARIMA (1,1,2)×(0,0,1) 12 were separately shown in Tables 3 and 4, and all P-values were statistically significant (P < 0.05). As shown in Table 5, ARIMA models were developed for each air pollutant and the optimal models of each air pollutant were selected according to the AIC and BIC criteria, respectively.

The results of air pollutants inclusion
In order to investigate the correlation between PTB cases and each air pollutant at different lag times, we will find the best multivariate model. Hence, we considered all air pollutants (PM 2.5 , PM 10 , SO 2 , CO, NO 2, and O 3 ) as regression variables in the ARIMA (1,1,2)×(0,0,1) 12 model. As shown in Fig 4, there were significant correlations between PM 2.5 , PM 10 , NO 2 , SO 2 , CO, and the PTB cases, except for O 3 (see Fig 4D). More specifically, the monthly average of SO 2 at a lag of 6 months, the monthly average of PM 10 at a lag of 10 months and the monthly average of PM 2.5 at a lag of 12 months, the monthly average of NO 2 at a lag of 1 month or 5 months, the monthly average of CO at a lag of 3 months were significantly related to the number of PTB cases.
In the following, these five relative air pollutants SO 2 , PM 10 , PM 2.5 , NO 2, and CO were included in the multivariate ARIMA model to establish the corresponding ARIMAX models. Only three of the seven ARIMAX models passed the residual and parameter tests, and their AIC and MAPE values were calculated, respectively (see Table 6). As shown in Table 6, the values of AIC and MAPE of ARIMAX models included air pollutants were lower than the ARIMA model. In particular, ARIMAX (1,1,2)×(0,1,1) 12 +PM 2.5 with 12-month lag has the smallest AIC value (AIC = 479.32) and MAPE value (MAPE = 6.766%), which was the optimal ARIMAX model.  ARIMA (1,1,2)×(0,0,1)

RNN model
Firstly, the appropriate parameters of each RNN model were identified by comparing the MAPE values. It was found that RNN5 model had the smallest MAPE value (see Table 7), which implied RNN5 model was the optimal one. Apart from CO, other air pollutants in

PLOS ONE
different lag orders (O 3 , PM 2.5 , PM 10 , SO 2 , and NO 2 ) had significant correlations with the PTB cases (see Fig 5). Then, air pollutants O 3 , PM 2.5 , PM 10 , SO 2 , and NO 2 were incorporated into RNN5 model to construct other RNN models (RNN6~RNN10). As shown in Table 8, the smallest MAPE value in RNN6-RNN10 models separately were RNN6(RNN5+PM 10 (lag8)), RNN7(RNN5+SO 2 (lag8)), RNN8(RNN5+O 3 (lag7)), RNN9(RNN5+PM 2.5 (lag8)), RNN10 (RNN5+NO 2 (lag8)). Thirdly, comparing results of the 10 models in Tables 7 and 8 found that RNN9 (RNN5+PM 2.5 (lag8)) model was determined the optimal RNN model with the smallest MAPE (6.29%). As shown in Fig 6, the downward trend in epoch-error plots of RNN9 after three training cycles was no longer significant after reaching the set number of epochs, which indicated that the training epochs of RNN9 were appropriate. Fig 6 shows the plots of the training errors function of the PTB cases prediction model changing with the number of iterations, and it can be seen that RNN model tends to be stable (with the error value < 0.05) when the number of trainings reaching 600, which indicates that the prediction performance is better.

Results of ARIMA, ARIMAX, and RNN model assessment
As shown in Fig 7, the comprehensive accuracies of the ARIMA, ARIMAX and RNN models are quantitatively measured by the DISO with the values of 7.94, 1.45, and 2.01, respectively. It was implied that ARIMAX model was the optimal one superior to the ARIMA and RNN models. Therefore, in the following, ARIMAX model was applied to predict the PTB cases in Urumqi from January to December 2019.

Discussions
It is well known that air pollution is a global health threat. Although the bronchopulmonary tract has multiple protective mechanisms, air pollution can still harm acutely for respiratory system. Relevant results [40,41] have shown that the concentration of air pollutants has been linked with clinical manifestations of pulmonary diseases and it is associated with morbidity and mortality induced by respiratory diseases. In this paper, the impact of air pollutants (CO, PM 2.5 , PM 10 , SO 2 , O 3, and NO 2 ) on the number of PTB cases in Urumqi was investigated by using ARIMA, ARIMAX, and RNN models. The results of the cross-correlation analysis showed that apart from O 3 , other air pollutants (PM 2.5 , PM 10 , SO 2 , CO, and NO 2 ) all had a lagged effect on the PTB cases in Urumqi, which is consistent with the findings in [42]. Specifically, PM 2.5 had a lag (12 months) impact on the number of PTB cases in Urumqi. This may be due to the fact that PM 2.5 can enter the fine bronchi and alveoli of the lung through the respiratory tract, and increased secretion and susceptibility of the respiratory mucosa thereby leading to the obstruction of the mucus-cilia clearance mechanism. Another potential explanation might be that, when a large amount of PM 2.5 is inhaled into the lung through the respiratory tract, macrophages will produce a huge number of bioactive factors acting on PM 2.5 and release inflammatory factors to damage the tissue structure of the lung, which may result in inflammatory lesions in the lungs. Both processes are slow, which could lead to a lagged effect of PM 2.5 on the development of PTB. The result in [43] also showed that PM 2.5 has a certain chronic health risk for humans in Urumqi.
It can also be seen from the results of this paper that those three models (ARIMA, ARIMAX and RNN model) have different merits in data analysis. For example, ARIMA model is adept at identifying hidden trends (such as autocorrelation, and seasonal variation) in a dataset. ARIMA could capture behaviors of both stationary and non-stationary series and describe the linear relationship between disease incidence and predictors, but its predictive ability is limited by reliance on prior knowledge of parameters or inherent time-lag and it is not account for

PLOS ONE
additional factors which influence the occurrence and development of PTB. Different from ARIMA model, ARIMAX model could deal with multivariate time series data by adding other variables related to the PTB cases series to improve the prediction accuracy. However, the essence of ARIMA and ARIMAX is linear and it is insufficient to fit the complex multivariable dependencies, RNN model with a strong nonlinear fitting ability can overcome this limitation. Moreover, RNN retains more long-term sequence information and has memory to store the values that have been calculated.
This paper also has several limitations. Firstly, ARIMAX model is dependent on a large amount of historical data and requires the data to remain relatively stable, so as to achieve accurate and effective prediction. If the external factors suddenly change or new variables are introduced, the prediction effect of the model will be affected and thus the prediction performance will be reduced. In order to achieve more accurate prediction, ARIMAX model can be combined with differential equation models, regression analysis models, gray prediction models, artificial neural networks and other models to propose a combination model of time series analysis. Furthermore, the corresponding combined models can be built to obtain more accurate prediction by considering meteorological factors, economic factors, and other factors that have an impact on PTB.

Conclusion
In this paper, by using the ARIMA model, a multivariate time series ARIMAX and RNN model, the impact of air pollutants (O 3 , PM 2.5 , PM 10 , SO 2 , CO, and NO 2 ) on the number of PTB cases in Urumqi was investigated. It was found that ARIMAX model is obviously good in data fitting, superiorly to ARIMA and RNN models (especially from January 2014 to June 2015), which has also been confirmed by the result that ARIMAX model had the smallest DISO value by comparing with those of the other two models. Therefore, ARIMAX (1,1,2)×(0,1,1) 12 +PM 2.5 with 12-month lag was applied to predict the number of PTB cases from January to December 2019 in Urumqi. The predicted results of the ARIMAX model were in good agreement with the actual PTB cases, which presented that ARIMAX model had high accurate forecasting power and was applicable for predicting PTB cases in Urumqi. Moreover, the predicting results suggested that PTB cases declined obviously. It may be related to the comprehensive coverage of DOTS strategy and the implementation of universal health checkups in Urumqi, which make more PTB patients without discharge of bacterium to be earlier detected and diagnosed. Additionally, the centralized hospitalization of PTB patients in the infectious stage and the plan of "centralized medication + nutritional breakfast" for PTB patients have been carried out in Urumqi, which would effectively promote the recovery of PTB patients and reduce the spread of tuberculosis. A series of the adjustment of energy structure has improved air quality in Urumqi, such as "coal to gas conversion" and the "Blue Sky Project", which would reduce the emission of PM 2.5 , PM 10 , and other air pollutants thus decreasing the risk of PTB.